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Podoviruses that infect marine picocyanobacteria are abundant and could play a significant role on 
regulating host populations due to their specific phage-host relationship. Genome sequencing of 
cyanophages has unveiled that many marine cyanophages encode certain photosynthetic genes like psbA. It 
appears that psbA is only present in certain groups of cyanopodovirus isolates. In order to better understand 
the prevalence of psbA in cyanobacterial podoviruses, we searched the marine metagenomic database (GOS, 
BATS, HOT and MarineVirome). Our study suggests that 89% of recruited cyanopodovirus scaffolds from 
the GOS database contained the psbA gene, supporting the ecological relevance of the photosynthesis gene 
for surface oceanic cyanophages. Diversification between Clade A and B are consistent with recent finding of 
two major groups of cyanopodo viruses. All the data also shows that Clade B cyanopodo viruses dominate the 
surface ocean water, while Clade A cyanopodoviruses become more important in the coastal and estuarine 
environments. 



Viruses are abundant in the ocean and can influence population dynamics and genetic diversity of their 
hosts 1 " 3 . Cyanophage are a specific group of viruses which infect cyanobacteria mainly including 
Prochlorococcus and Synechococcus. Many cyanophages have been isolated, and all the known marine 
cyanophages belong to three phage families: Myoviridae, Siphoviridae and Podoviridae 4 ' 10 . Recent studies showed 
that cyanopodoviruses might make up 50% of cyanophage community in the sea 11 " 12 , suggesting that cyanopo- 
doviruses interact actively with cyanobacteria in the marine environment. 

Currently, nearly 40 cyanophage genomes have been sequenced, and half of them are cyanomyo viruses. 
Cyanomyoviruses have a relatively large genome size and acquire many accessory metabolic genes via horizontal 
gene transfer (HGT), which constitute the large reservoir of genetic diversity pool 13 " 19 . Five genome sequences of 
cyanosiphoviruses have been reported with genome size ranging from 30,332 to 105,532 bps 20 " 21 . Compared to 
cyanomyoviruses and cyanosiphoviruses, cyanopodoviruses have a relatively conserved genome size ranging 
from 42,257 to 47,872 bps 11,22 " 25 . 

Genome sequencing of marine cyanophages has shown that many marine cyanophages encode photosynthesis 
genes. All the isolated cyanomyoviruses and more than half of the isolated cyanopodoviruses were detected to 
contain the key photosystem II reaction centre gene psbA in their genomes 111317 " 19 ' 23 ' 26 " 29 , while no psbA gene was 
found among the known cyanosiphoviruses 20 ' 30 . Two recent studies showed that 24 of 39 marine cyanopodovirus 
isolates contained psbA 12 and 8 of 12 sequenced cyanopodovirus genomes encoded psbA 13 . In these two studies, 
the frequency of psfrA-containing podoviruses was estimated based on isolated cyanophages which could be 
biased by the host used for isolation. Is it possible to quantify the presence of psbA in cyanopodoviruses in the 
ocean using a culture-independent approach? The metagenomic database is a useful tool, however these datasets 
in the public domain are also limited and may not represent true community composition. 

Results 

In this study, we estimated the relative abundance and distribution of psfrA-containing podoviruses based on 
the metagenomic data. Our approach is built on a conserved genomic structure of cyanopodoviruses. 
Cyanopodovirus genome organization can be divided into three parts: structural genes, nucleotide metabolism 
related genes and some hypothetical genes regions (Fig. i) n > 18 > 22 - 24 . Both the composition and the arrangement of 
structural genes are conserved. One gene cluster, the "portal-capsid-tail/fiber", existed in all cyanopodoviruses, as 
well as in other T7 phages 31 . Interestingly, the psbA gene was commonly located at a fixed position within the 
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Figure 1 | The structure and organization of cyanopodoviruses and some scaffolds or contigs. 



conserved gene cluster ' portal-psfrA-capsid" 11 . Based on this con- 
served gene cluster, we searched (BLAST) the GOS scaffold database 
using portal, capsid assembly, psbA and major capsid protein (MCP) 
genes, and successfully retrieved 79 cyanopodoviral scaffolds from 
the GOS database. 

Among the 79 cyanopodovirus scaffolds, 70 contain psbA and 9 
have no psbA. All the MCP sequences (>200 aa) were used to con- 
struct the phylogenetic tree. The MCP based phylogeny separated 
cyanopodoviruses into two major clades (Clade A and B) (Fig. 2), 
which is consistent with the phylogenetic relationship based on the 
DNA polymerase gene 10,12 ' 21 ' 32 . Nearly all cyanopodoviruses in Clade 
B carry thepsM. gene whereas none of those in Clade A do (Fig. 2). A 
recent study also illustrated such psbA distribution pattern in cya- 
nopodoviruses 12 . 

In the Bermuda (BATS) database, 58 Clade B MCP homologs were 
recruited, but no Clade A MCP was found (Fig. 3A). We recruited 17 
Clade B homologs, but no Clade A homologs from the North Pacific 
(HOT) database (Fig. 3A). In the GOS database, 729 Clade B MCP 
homologs and 18 Clade A MCP homologs were found (Fig. 3A). 
Interestingly, 17 of 18 of reads were recruited from the coastal water. 
It is likely that most of Clade A like sequences are from the podo- 
viruses infecting marine Synechococcus 10 ' 33 ' 34 . In the MarineVirome 
database, 271 Clade B like MCP sequences and 4 Clade A like MCP 
sequences were detected (Fig. 3A). 

Discussion 

Podoviruses in Clade A could be a transitional group between Clade 
B and other T7-like non-cyanobacterial podoviruses (Fig. 2). Four 
scaffolds in Clade B do not contain psbA, and the psbA gene in these 
four scaffolds might be lost during the evolution. Interestingly, scaf- 
fold JCVI_SCAF_109662694693 (in Clade B) contains a high light- 
induced gene (hit), but no psbA. 

Our analysis suggests that Clade A podoviruses only make up a 
very small proportion of cyanopodoviruses in the surface ocean. In 
the open ocean, Clade A podoviruses only account for 0.27% and 
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Figure 2 | The neighbor- joining tree based on the MCP sequences. The 

sequences with red color mean scaffolds or cyanophage genomes without 
psbA genes. Values of >50% are shown, and indicate percentage bootstrap 
support based on 1000 replicates for distance, maximum parsimony (MP) 
and minimum evolution (ME) analyses in the order of NJ/MP/ME. Scale 
bar, 0.1 nucleotide substitution per site. 
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Figure 3 | Number and distribution pattern of cyanopodoviral major 
capsid reads in the database. A, Read counts of major capsid 
corresponding to Clade A and B in four metagenomic database, BATS, 
GOS, HOT and MarineVirome. B, Proportion of reads belonging to Clade 
A in open ocean and coastal water, respectively. 

1.12% of all cyanopodoviruses in the GOS and MarineVirome data- 
bases, respectively. In the coastal surface water, Clade A podoviruses 
can make up 8.02% and 14.29% of total cyanopodoviruses in the GOS 
and MarineVirome databases, respectively (Fig. 3B). Clade A podo- 
viruses were not detected in the two open ocean stations, BATS and 
HOT. Clade A mainly consists of the psfrA-lacking podoviruses 
which infect marine Synechococcus 10 ' 12 . Our study suggests that it 
may be less important for cyanophages in coastal or estuarine envir- 
onments to carry the psbA gene compared to cyanophages in the 
open ocean. Sullivan and colleagues also suggested a shorter latent 
period could explain the lack of psbA gene as result of shorter infec- 
tion duration with no need the help of psbA 23 . 

The metagenomic recruitment based on the unique portal-capsid 
structure provides a culture-independent survey on the distribution 
frequency of psfrA-carrying cyanopodoviruses. However all of the 
datasets that were analyzed were mainly derived from the surface 
ocean. Our analysis suggests: 1) psbA- carrying cyanopodoviruses are 
the dominant cyanopodoviruses in the surface ocean; 2) Synecho- 
coccus podoviruses become relatively more abundant in the coastal 
water; 3) psbA is more important for oceanic cyanopodoviruses than 
for their coastal counterparts. 

Methods 

Metagenomics. Four metagenomic databases were used to search homologs in our 
study: three from the bacterial fraction: the Global Ocean Survey database (GOS) 35 , 
the Bermuda database (BATS) 36 , the Hawaii Ocean Time-Series (HOT) 37,38 , and one 
viral fraction database: the MarineVirome 39 . All databases were obtained from the 
CAMERA website (http://camera.calit2.net/index.shtm). 



Based on the cynaopodo virus genomic conserved gene cluster "portal-psfrA-cap- 
sid", we searched (BLAST) the GOS scaffold database using portal, capsid assembly, 
psbA and major capsid protein (MCP) genes using a reciprocal best-hit BLAST 
strategy but no e-value cutoff limitation (Fig. I) 40 . The structural genes (portal, MCP 
or capsid assembly gene) allowed the identification of cyanopodoviruses via searching 
against the NCBI non-redundant proteins database. 

To analyze the occurrence frequency and geographic pattern of cyanopodoviruses 
in the ocean, we recruited reads from BATS, GOS, HOT and MarineVirome datasets 
using all MCP sequences from sequenced cyanopodoviral genomes as published in 
Labrie's paper 1113 . Our approach is similar to the methods described by Zhao et al. 40 ' 41 . 
Briefly, all homologous reads were recruited from binning by e-value cutoff to avoid 
potential bias, and then each putative hit was extracted and used as a query to search 
against the NCBI non-redundant proteins database 42 . Metagenomic sequences 
returned a best-hit which could be used to confirm the classification, and all identified 
reads are listed in Table SI. The number of recruited reads was not normalized, 
because the method for sampling is different among all the sites and doesn't target the 
viruses. However, there should be no bias for cyanopodoviruses with or without psbA 
gene using any methods for sampling. 

Phylogenetic analyses. All the MCP sequences (>200 aa) were used to construct the 
phylogenetic tree. Sequences were aligned using Clustal X and phylogenetic trees were 
constructed using the neighbour -joining, minimum-evolution and maximum- 
parsimony algorithms of MEGA software 3.0 42 . The phylogenetic trees were 
supported by bootstrap for re-sampling test with 1000 replicates. 
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